This project uses Supervised Learning to predict the type of a dry bean.
We use a dataset which contains around 13600 samples. These samples have the following attributes:
As for bean types, there are 7 possible types: Seker, Barbunya, Bombay, Cali, Horoz, Sira, Dermason.
Taking advantage of the different values of these attributes, we will use different classifiers to evaluate what's the bean type that is the most likely to be the correct choice. The model should also have a decent prediction score.
#Packages used in this notebook
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, plot_confusion_matrix
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn import neighbors
from sklearn.model_selection import cross_val_score
import numpy as np
def histograms(className):
plt.figure(figsize=(20, 20))
for column_index, columnName in enumerate(dry_beans_data.columns):
if columnName == 'Class':
continue
plt.subplot(4, 4, column_index + 1)
dry_beans_data.loc[dry_beans_data['Class'] == className, columnName].hist(legend='true')
def findDuplicates(className):
cnt=0
for i in dry_beans_data.loc[dry_beans_data['Class'] == className].duplicated():
if i == True: cnt += 1
print("Reapeated ",className,": %d" % cnt)
def gridSearchScore(technique, parameter_grid,cross_validation, labels, inputs):
grid_search = GridSearchCV(technique,
param_grid=parameter_grid,
cv=cross_validation)
grid_search.fit(inputs, labels)
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))
return grid_search
def generateMetrics(labels,pred):
# accuracy
print('Accuracy: ', accuracy_score(labels, pred))
# precision
print('Precision: ', precision_score(labels, pred, average="weighted"))
# recall
print('Recall: ', recall_score(labels, pred, average="weighted"))
# f1
print('F1: ', f1_score(labels, pred, average="weighted"))
def confusionMatrix(grid_search, inputs,labels):
dct = grid_search.best_estimator_
pred = dct.predict(inputs)
fig,ax = plt.subplots(figsize=(10,10))
cm = confusion_matrix(labels, pred, labels=dct.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=dct.classes_)
disp.plot(ax=ax)
return pred
#Reading of data
dry_beans_data = pd.read_csv('Dry_Bean_Dataset.csv', index_col=0)
#Describing data
dry_beans_data.describe()
| Area | Perimeter | MajorAxisLength | MinorAxisLength | AspectRation | Eccentricity | ConvexArea | EquivDiameter | Extent | Solidity | roundness | Compactness | ShapeFactor1 | ShapeFactor2 | ShapeFactor3 | ShapeFactor4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 13611.000000 | 13611.000000 | 13611.000000 | 13611.000000 | 13611.000000 | 13611.000000 | 13611.000000 | 13611.000000 | 13611.000000 | 13611.000000 | 13611.000000 | 13611.000000 | 13611.000000 | 13611.000000 | 13611.000000 | 13611.000000 |
| mean | 53048.284549 | 855.283459 | 320.141867 | 202.270714 | 1.583242 | 0.750895 | 53768.200206 | 253.064220 | 0.749733 | 0.987143 | 0.873282 | 0.799864 | 0.006564 | 0.001716 | 0.643590 | 0.995063 |
| std | 29324.095717 | 214.289696 | 85.694186 | 44.970091 | 0.246678 | 0.092002 | 29774.915817 | 59.177120 | 0.049086 | 0.004660 | 0.059520 | 0.061713 | 0.001128 | 0.000596 | 0.098996 | 0.004366 |
| min | 20420.000000 | 524.736000 | 183.601165 | 122.512653 | 1.024868 | 0.218951 | 20684.000000 | 161.243764 | 0.555315 | 0.919246 | 0.489618 | 0.640577 | 0.002778 | 0.000564 | 0.410339 | 0.947687 |
| 25% | 36328.000000 | 703.523500 | 253.303633 | 175.848170 | 1.432307 | 0.715928 | 36714.500000 | 215.068003 | 0.718634 | 0.985670 | 0.832096 | 0.762469 | 0.005900 | 0.001154 | 0.581359 | 0.993703 |
| 50% | 44652.000000 | 794.941000 | 296.883367 | 192.431733 | 1.551124 | 0.764441 | 45178.000000 | 238.438026 | 0.759859 | 0.988283 | 0.883157 | 0.801277 | 0.006645 | 0.001694 | 0.642044 | 0.996386 |
| 75% | 61332.000000 | 977.213000 | 376.495012 | 217.031741 | 1.707109 | 0.810466 | 62294.000000 | 279.446467 | 0.786851 | 0.990013 | 0.916869 | 0.834270 | 0.007271 | 0.002170 | 0.696006 | 0.997883 |
| max | 254616.000000 | 1985.370000 | 738.860153 | 460.198497 | 2.430306 | 0.911423 | 263261.000000 | 569.374358 | 0.866195 | 0.994677 | 0.990685 | 0.987303 | 0.010451 | 0.003665 | 0.974767 | 0.999733 |
sb.pairplot(dry_beans_data.dropna(), hue='Class');
Finding null or not available values
dry_beans_data.loc[dry_beans_data['Area'].isnull() | dry_beans_data['Perimeter'].isnull() |
dry_beans_data['MajorAxisLength'].isnull() | dry_beans_data['MinorAxisLength'].isnull() |
dry_beans_data['Eccentricity'].isnull() | dry_beans_data['ConvexArea'].isnull() |
dry_beans_data['EquivDiameter'].isnull() | dry_beans_data['Solidity'].isnull() |
dry_beans_data['Extent'].isnull() | dry_beans_data['roundness'].isnull() |
dry_beans_data['Compactness'].isnull() |
dry_beans_data['ShapeFactor1'].isnull() | dry_beans_data['ShapeFactor2'].isnull() |
dry_beans_data['ShapeFactor3'].isnull() | dry_beans_data['ShapeFactor4'].isnull()
]
| Area | Perimeter | MajorAxisLength | MinorAxisLength | AspectRation | Eccentricity | ConvexArea | EquivDiameter | Extent | Solidity | roundness | Compactness | ShapeFactor1 | ShapeFactor2 | ShapeFactor3 | ShapeFactor4 | Class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Unnamed: 0 |
dry_beans_data.loc[dry_beans_data['Area'].isna() | dry_beans_data['Perimeter'].isna() |
dry_beans_data['MajorAxisLength'].isna() | dry_beans_data['MinorAxisLength'].isna() |
dry_beans_data['Eccentricity'].isna() | dry_beans_data['ConvexArea'].isna() |
dry_beans_data['EquivDiameter'].isna() | dry_beans_data['Solidity'].isna() |
dry_beans_data['Extent'].isna() | dry_beans_data['roundness'].isna() |
dry_beans_data['Compactness'].isna() |
dry_beans_data['ShapeFactor1'].isna() | dry_beans_data['ShapeFactor2'].isna() |
dry_beans_data['ShapeFactor3'].isna() | dry_beans_data['ShapeFactor4'].isna()
]
| Area | Perimeter | MajorAxisLength | MinorAxisLength | AspectRation | Eccentricity | ConvexArea | EquivDiameter | Extent | Solidity | roundness | Compactness | ShapeFactor1 | ShapeFactor2 | ShapeFactor3 | ShapeFactor4 | Class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Unnamed: 0 |
SEKER
histograms('SEKER');
By analyzing the histograms above, we can see that in the Solidity histogram of the Seker bean, the majority of the beans has a 0.98 or greater value. After noticing that, we considered that values below that could be considered outliers and therefore removed. Only 14 of the 2027 Seker beans fit this description.
#Before removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'SEKER', 'Solidity'].hist(legend='true')
<AxesSubplot:>
dry_beans_data = dry_beans_data.loc[(dry_beans_data['Solidity'] >= 0.98) | (dry_beans_data['Class'] != 'SEKER')]
#After removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'SEKER', 'Solidity'].hist(legend='true')
<AxesSubplot:>
By analyzing the histograms above, we can see that in the ShapeFactor4 histogram of the Seker bean, the majority of the beans has a 0.996 or greater value. After noticing that, we considered that values below that could be considered outliers and therefore removed. Only 39 of the 2013 Seker beans fit this description.
#Before removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'SEKER', 'ShapeFactor4'].hist(legend='true')
<AxesSubplot:>
dry_beans_data = dry_beans_data.loc[(dry_beans_data['ShapeFactor4'] >= 0.996) | (dry_beans_data['Class'] != 'SEKER')]
#After removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'SEKER', 'ShapeFactor4'].hist(legend='true')
<AxesSubplot:>
By analyzing the histograms above, we can see that in the Roundness histogram of the Seker bean, the majority of the beans has a 0.88 or greater value. After noticing that, we considered that values below that could be considered outliers and therefore removed. Only 48 of the 1974 Seker beans fit this description.
#Before removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'SEKER', 'roundness'].hist(legend='true')
<AxesSubplot:>
dry_beans_data = dry_beans_data.loc[(dry_beans_data['roundness'] >= 0.88) | (dry_beans_data['Class'] != 'SEKER')]
#After removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'SEKER', 'roundness'].hist(legend='true')
<AxesSubplot:>
BARBUNYA
histograms('BARBUNYA');
By analyzing the histograms above, we can see that in the ShapeFactor4 histogram of the Barbunya bean, the majority of the beans has a 0.988 or greater value. After noticing that we considered that values below that could be considered outliers and therefore removed. Only 24 of the 1322 Barbunya beans fit this description.
#Before removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'BARBUNYA', 'ShapeFactor4'].hist(legend='true')
<AxesSubplot:>
dry_beans_data = dry_beans_data.loc[(dry_beans_data['ShapeFactor4'] >= 0.988) | (dry_beans_data['Class'] != 'BARBUNYA')]
#After removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'BARBUNYA', 'ShapeFactor4'].hist(legend='true')
<AxesSubplot:>
By analyzing the histograms above, we can see that in the Eccentricity histogram of the Barbunya bean, the majority of the beans has a 0.6 or greater value. After noticing that we considered that values below that could be considered outliers and therefore removed. Only 14 of the 1298 Barbunya beans fit this description.
#Before removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'BARBUNYA', 'Eccentricity'].hist(legend='true')
<AxesSubplot:>
dry_beans_data = dry_beans_data.loc[(dry_beans_data['Eccentricity'] >= 0.6) | (dry_beans_data['Class'] != 'BARBUNYA')]
#After removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'BARBUNYA', 'Eccentricity'].hist(legend='true')
<AxesSubplot:>
BOMBAY
histograms('BOMBAY');
By analyzing the histograms above, we can see that in the Solidity histogram of the Bombay bean, the majority of the beans has a 0.973 or greater value. After noticing that we considered that values below that could be considered outliers and therefore removed. Only 14 of the 522 Bombay beans fit this description.
#Before removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'BOMBAY', 'Solidity'].hist(legend='true')
<AxesSubplot:>
dry_beans_data = dry_beans_data.loc[(dry_beans_data['Solidity'] >= 0.973) | (dry_beans_data['Class'] != 'BOMBAY')]
#After removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'BOMBAY', 'Solidity'].hist(legend='true')
<AxesSubplot:>
By analyzing the histograms above, we can see that in the Roundness histogram of the Bombay bean, the majority of the beans has a value between 0.8 and 0.925. After noticing that we considered that values below 0.8 and greater than 0.925 could be considered outliers and therefore removed. Only 8 of the 508 Bombay beans fit this description.
#Before removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'BOMBAY', 'roundness'].hist(legend='true')
<AxesSubplot:>
dry_beans_data = dry_beans_data.loc[(dry_beans_data['roundness'] >= 0.8) | (dry_beans_data['Class'] != 'BOMBAY')]
dry_beans_data = dry_beans_data.loc[(dry_beans_data['roundness'] <= 0.925) | (dry_beans_data['Class'] != 'BOMBAY')]
#After removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'BOMBAY', 'roundness'].hist(legend='true')
<AxesSubplot:>
CALI
histograms('CALI')
By analyzing the histograms above, we can see that in the Perimeter histogram of the Cali bean, the majority of the beans has a value between 900 and 1225. After noticing that we considered that values below 900 and greater than 1225 could be considered outliers and therefore removed. Only 35 of the 1630 Cali beans fit this description.
#Before removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'CALI', 'Perimeter'].hist(legend='true')
<AxesSubplot:>
dry_beans_data = dry_beans_data.loc[(dry_beans_data['Perimeter'] >= 900) | (dry_beans_data['Class'] != 'CALI')]
dry_beans_data = dry_beans_data.loc[(dry_beans_data['Perimeter'] <= 1225) | (dry_beans_data['Class'] != 'CALI')]
#After removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'CALI', 'Perimeter'].hist(legend='true')
<AxesSubplot:>
By analyzing the histograms above, we can see that in the Eccentricity histogram of the Cali bean, the majority of the beans has a 0.76 or greater value. After noticing that we considered that values below that could be considered outliers and therefore removed. Only 25 of the 1595 Cali beans fit this description.
#Before removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'CALI', 'Eccentricity'].hist(legend='true')
<AxesSubplot:>
dry_beans_data = dry_beans_data.loc[(dry_beans_data['Eccentricity'] >= 0.76) | (dry_beans_data['Class'] != 'CALI')]
#After removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'CALI', 'Eccentricity'].hist(legend='true')
<AxesSubplot:>
By analyzing the histograms above, we can see that in the Compactness histogram of the Cali bean, the majority of the beans has a 0.71 or greater value. After noticing that we considered that values below that could be considered outliers and therefore removed. Only 13 of the 1570 Cali beans fit this description.
#Before removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'CALI', 'Compactness'].hist(legend='true')
<AxesSubplot:>
dry_beans_data = dry_beans_data.loc[(dry_beans_data['Compactness'] >= 0.71) | (dry_beans_data['Class'] != 'CALI')]
#After removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'CALI', 'Compactness'].hist(legend='true')
<AxesSubplot:>
By analyzing the histograms above, we can see that in the Roundness histogram of the Cali bean, the majority of the beans has a value between 0.78 and 0.895. After noticing that we considered that values below 0.78 and greater than 0.895 could be considered outliers and therefore removed. Only 21 of the 1557 Cali beans fit this description.
#Before removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'CALI', 'roundness'].hist(legend='true')
<AxesSubplot:>
dry_beans_data = dry_beans_data.loc[(dry_beans_data['roundness'] >= 0.78) | (dry_beans_data['Class'] != 'CALI')]
dry_beans_data = dry_beans_data.loc[(dry_beans_data['roundness'] <= 0.895) | (dry_beans_data['Class'] != 'CALI')]
#After removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'CALI', 'roundness'].hist(legend='true')
<AxesSubplot:>
By analyzing the histograms above, we can see that in the ShapeFactor2 histogram of the Cali bean, the majority of the beans has a value between 0.00088 and 0.0014. After noticing that we considered that values below 0.00088 and greater than 0.0014 could be considered outliers and therefore removed. Only 19 of the 1536 Cali beans fit this description.
#Before removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'CALI', 'ShapeFactor2'].hist(legend='true')
<AxesSubplot:>
dry_beans_data = dry_beans_data.loc[(dry_beans_data['ShapeFactor2'] >= 0.00088) | (dry_beans_data['Class'] != 'CALI')]
dry_beans_data = dry_beans_data.loc[(dry_beans_data['ShapeFactor2'] <= 0.0014) | (dry_beans_data['Class'] != 'CALI')]
#After removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'CALI', 'ShapeFactor2'].hist(legend='true')
<AxesSubplot:>
By analyzing the histograms above, we can see that in the ShapeFactor4 histogram of the Cali bean, the majority of the beans has a 0.978 or greater value. After noticing that we considered that values below that could be considered outliers and therefore removed. Only 18 of the 1517 Cali beans fit this description.
#Before removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'CALI', 'ShapeFactor4'].hist(legend='true')
<AxesSubplot:>
dry_beans_data = dry_beans_data.loc[(dry_beans_data['ShapeFactor4'] >= 0.978) | (dry_beans_data['Class'] != 'CALI')]
#After removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'CALI', 'ShapeFactor4'].hist(legend='true')
<AxesSubplot:>
HOROZ
histograms('HOROZ')
By analyzing the histograms above, we can see that in the Eccentricity histogram of the Horoz bean, the majority of the beans has a 0.805 or greater value. After noticing that we considered that values below that could be considered outliers and therefore removed. Only 29 of the 1928 Horoz beans fit this description.
#Before removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'HOROZ', 'Eccentricity'].hist(legend='true')
<AxesSubplot:>
dry_beans_data = dry_beans_data.loc[(dry_beans_data['Eccentricity'] >= 0.805) | (dry_beans_data['Class'] != 'HOROZ')]
#After removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'HOROZ', 'Eccentricity'].hist(legend='true')
<AxesSubplot:>
By analyzing the histograms above, we can see that in the Solidity histogram of the Horoz bean, the majority of the beans has a 0.97 or greater value. After noticing that we considered that values below that could be considered outliers and therefore removed. Only 65 of the 1899 Horoz beans fit this description.
#Before removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'HOROZ', 'Solidity'].hist(legend='true')
<AxesSubplot:>
dry_beans_data = dry_beans_data.loc[(dry_beans_data['Solidity'] >= 0.97) | (dry_beans_data['Class'] != 'HOROZ')]
#After removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'HOROZ', 'Solidity'].hist(legend='true')
<AxesSubplot:>
By analyzing the histograms above, we can see that in the Roundness histogram of the Horoz bean, the majority of the beans has a value between 0.72 and 0.86. After noticing that we considered that values below 0.72 and greater than 0.86 could be considered outliers and therefore removed. Only 36 of the 1834 Horoz beans fit this description.
#Before removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'HOROZ', 'roundness'].hist(legend='true')
<AxesSubplot:>
dry_beans_data = dry_beans_data.loc[(dry_beans_data['roundness'] >= 0.72) | (dry_beans_data['Class'] != 'HOROZ')]
dry_beans_data = dry_beans_data.loc[(dry_beans_data['roundness'] <= 0.86) | (dry_beans_data['Class'] != 'HOROZ')]
#After removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'HOROZ', 'roundness'].hist(legend='true')
<AxesSubplot:>
By analyzing the histograms above, we can see that in the Compactness histogram of the Horoz bean, the majority of the beans has a 0.655 or greater value. After noticing that we considered that values below that could be considered outliers and therefore removed. Only 16 of the 1798 Horoz beans fit this description.
#Before removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'HOROZ', 'Compactness'].hist(legend='true')
<AxesSubplot:>
dry_beans_data = dry_beans_data.loc[(dry_beans_data['Compactness'] >= 0.655) | (dry_beans_data['Class'] != 'HOROZ')]
#After removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'HOROZ', 'Compactness'].hist(legend='true')
<AxesSubplot:>
By analyzing the histograms above, we can see that in the MinorAxisLength histogram of the Horoz bean, the majority of the beans has a 210 or lesser value. After noticing that we considered that values above that could be considered outliers and therefore removed. Only 32 of the 1782 Horoz beans fit this description.
#Before removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'HOROZ', 'MinorAxisLength'].hist(legend='true')
<AxesSubplot:>
dry_beans_data = dry_beans_data.loc[(dry_beans_data['MinorAxisLength'] <= 210) | (dry_beans_data['Class'] != 'HOROZ')]
#After removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'HOROZ', 'MinorAxisLength'].hist(legend='true')
<AxesSubplot:>
By analyzing the histograms above, we can see that in the EquivDiameter histogram of the Horoz bean, the majority of the beans has a 296 or lesser value. After noticing that we considered that values above that could be considered outliers and therefore removed. Only 12 of the 1750 Horoz beans fit this description.
#Before removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'HOROZ', 'EquivDiameter'].hist(legend='true')
<AxesSubplot:>
dry_beans_data = dry_beans_data.loc[(dry_beans_data['EquivDiameter'] <= 296) | (dry_beans_data['Class'] != 'HOROZ')]
#After removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'HOROZ', 'EquivDiameter'].hist(legend='true')
<AxesSubplot:>
By analyzing the histograms above, we can see that in the ShapeFactor2 histogram of the Horoz bean, the majority of the beans has a 0.00139 or lesser value. After noticing that we considered that values above that could be considered outliers and therefore removed. Only 36 of the 1738 Horoz beans fit this description.
#Before removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'HOROZ', 'ShapeFactor2'].hist(legend='true')
<AxesSubplot:>
dry_beans_data = dry_beans_data.loc[(dry_beans_data['ShapeFactor2'] <= 0.00139) | (dry_beans_data['Class'] != 'HOROZ')]
#After removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'HOROZ', 'ShapeFactor2'].hist(legend='true')
<AxesSubplot:>
By analyzing the histograms above, we can see that in the ShapeFactor4 histogram of the Horoz bean, the majority of the beans has a 0.98 or greater value. After noticing that we considered that values below that could be considered outliers and therefore removed. Only 37 of the 1702 Horoz beans fit this description.
#Before removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'HOROZ', 'ShapeFactor4'].hist(legend='true')
<AxesSubplot:>
dry_beans_data = dry_beans_data.loc[(dry_beans_data['ShapeFactor4'] >= 0.98) | (dry_beans_data['Class'] != 'HOROZ')]
#After removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'HOROZ', 'ShapeFactor4'].hist(legend='true')
<AxesSubplot:>
SIRA
histograms('SIRA')
By analyzing the histograms above, we can see that in the Perimeter histogram of the Sira bean, the majority of the beans has a value between 700 and 890. After noticing that we considered that values below that could be considered outliers and therefore removed. Only 82 of the 2636 Sira beans fit this description.
dry_beans_data.loc[dry_beans_data['Class'] == 'SIRA', 'Perimeter'].hist(legend='true')
<AxesSubplot:>
dry_beans_data = dry_beans_data.loc[(dry_beans_data['Perimeter'] <= 890) | (dry_beans_data['Class'] != 'SIRA')]
dry_beans_data = dry_beans_data.loc[(dry_beans_data['Perimeter'] >= 700) | (dry_beans_data['Class'] != 'SIRA')]
#After removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'SIRA', 'Perimeter'].hist(legend='true')
<AxesSubplot:>
By analyzing the histograms above, we can see that in the MajorAxisLength histogram of the Sira bean, the majority of the beans has a value between 259 and 342. After noticing that we considered that values below 259 and greater than 342 could be considered outliers and therefore removed. Only 44 of the 2554 Sira beans fit this description.
#Before removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'SIRA', 'MajorAxisLength'].hist(legend='true')
<AxesSubplot:>
dry_beans_data = dry_beans_data.loc[(dry_beans_data['MajorAxisLength'] <= 342) | (dry_beans_data['Class'] != 'SIRA')]
dry_beans_data = dry_beans_data.loc[(dry_beans_data['MajorAxisLength'] >= 259) | (dry_beans_data['Class'] != 'SIRA')]
#After removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'SIRA', 'MajorAxisLength'].hist(legend='true')
<AxesSubplot:>
By analyzing the histograms above, we can see that in the Solidity histogram of the Sira bean, the majority of the beans has a 0.98 or greater value. After noticing that we considered that values below that could be considered outliers and therefore removed. Only 41 of the 2510 Sira beans fit this description.
#Before removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'SIRA', 'Solidity'].hist(legend='true')
<AxesSubplot:>
dry_beans_data = dry_beans_data.loc[(dry_beans_data['Solidity'] >= 0.98) | (dry_beans_data['Class'] != 'SIRA')]
#After removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'SIRA', 'Solidity'].hist(legend='true')
<AxesSubplot:>
By analyzing the histograms above, we can see that in the Roundness histogram of the Sira bean, the majority of the beans has a value between 0.82 and 0.93. After noticing that we considered that values below 0.82 and above 0.93 could be considered outliers and therefore removed. Only 32 of the 2469 Sira beans fit this description.
#Before removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'SIRA', 'roundness'].hist(legend='true')
<AxesSubplot:>
dry_beans_data = dry_beans_data.loc[(dry_beans_data['roundness'] >= 0.82) | (dry_beans_data['Class'] != 'SIRA')]
dry_beans_data = dry_beans_data.loc[(dry_beans_data['roundness'] <= 0.93) | (dry_beans_data['Class'] != 'SIRA')]
#After removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'SIRA', 'roundness'].hist(legend='true')
<AxesSubplot:>
By analyzing the histograms above, we can see that in the Compactness histogram of the Sira bean, the majority of the beans has a 0.85 or lesser value. After noticing that we considered that values above that could be considered outliers and therefore removed. Only 18 of the 2437 Sira beans fit this description.
#Before removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'SIRA', 'Compactness'].hist(legend='true')
<AxesSubplot:>
dry_beans_data = dry_beans_data.loc[(dry_beans_data['Compactness'] <= 0.85) | (dry_beans_data['Class'] != 'SIRA')]
#After removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'SIRA', 'Compactness'].hist(legend='true')
<AxesSubplot:>
By analyzing the histograms above, we can see that in the ShapeFactor4 histogram of the Sira bean, the majority of the beans has a 0.988 or greater value. After noticing that we considered that values below that could be considered outliers and therefore removed. Only 17 of the 2419 Sira beans fit this description.
#Before removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'SIRA', 'ShapeFactor4'].hist(legend='true')
<AxesSubplot:>
dry_beans_data = dry_beans_data.loc[(dry_beans_data['ShapeFactor4'] >= 0.988) | (dry_beans_data['Class'] != 'SIRA')]
#After removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'SIRA', 'ShapeFactor4'].hist(legend='true')
<AxesSubplot:>
DERMASON
histograms('DERMASON')
By analyzing the histograms above, we can see that in the Solidity histogram of the Dermason bean, the majority of the beans has a 0.98 or greater value. After noticing that we considered that values below that could be considered outliers and therefore removed. Only 61 of the 3546 Dermason beans fit this description.
#Before removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'DERMASON', 'Solidity'].hist(legend='true')
<AxesSubplot:>
dry_beans_data = dry_beans_data.loc[(dry_beans_data['Solidity'] >= 0.98) | (dry_beans_data['Class'] != 'DERMASON')]
#After removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'DERMASON', 'Solidity'].hist(legend='true')
<AxesSubplot:>
By analyzing the histograms above, we can see that in the AspectRation histogram of the Dermason bean, the majority of the beans has a value between 1.27 and 1.75. After noticing that we considered that values below 1.27 and above 1.75 could be considered outliers and therefore removed. Only 56 of the 3485 Dermason beans fit this description.
#Before removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'DERMASON', 'AspectRation'].hist(legend='true')
<AxesSubplot:>
dry_beans_data = dry_beans_data.loc[(dry_beans_data['AspectRation'] >= 1.27) | (dry_beans_data['Class'] != 'DERMASON')]
dry_beans_data = dry_beans_data.loc[(dry_beans_data['AspectRation'] <= 1.75) | (dry_beans_data['Class'] != 'DERMASON')]
#After removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'DERMASON', 'AspectRation'].hist(legend='true')
<AxesSubplot:>
By analyzing the histograms above, we can see that in the ShapeFactor4 histogram of the Dermason bean, the majority of the beans has a 0.9915 or greater value. After noticing that we considered that values below that could be considered outliers and therefore removed. Only 43 of the 3429 Dermason beans fit this description.
#Before removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'DERMASON', 'ShapeFactor4'].hist(legend='true')
<AxesSubplot:>
dry_beans_data = dry_beans_data.loc[(dry_beans_data['ShapeFactor4'] >= 0.9915) | (dry_beans_data['Class'] != 'DERMASON')]
#After removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'DERMASON', 'ShapeFactor4'].hist(legend='true')
<AxesSubplot:>
We can also verify that in the Roundness histogram of the Dermason bean, the majority of the beans had a 0.85 or greater value. There were 39 out of the 3386 Dermason beans that didn't fit said criteria. These values were considered outliers and were removed.
#Before removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'DERMASON', 'roundness'].hist(legend='true')
<AxesSubplot:>
dry_beans_data = dry_beans_data.loc[(dry_beans_data['roundness'] >= 0.85) | (dry_beans_data['Class'] != 'DERMASON')]
#After removing outliers
dry_beans_data.loc[dry_beans_data['Class'] == 'DERMASON', 'roundness'].hist(legend='true')
<AxesSubplot:>
# Number of duplicate elements of each type
findDuplicates('SEKER');
findDuplicates('BARBUNYA');
findDuplicates('BOMBAY');
findDuplicates('CALI');
findDuplicates('HOROZ');
findDuplicates('SIRA');
findDuplicates('DERMASON');
Reapeated SEKER : 0 Reapeated BARBUNYA : 0 Reapeated BOMBAY : 0 Reapeated CALI : 0 Reapeated HOROZ : 62 Reapeated SIRA : 0 Reapeated DERMASON : 0
As we can see there is 62 elements of bean Horoz that are duplicated, so we need to remove them.
# Eliminating duplicated elements
dry_beans_data = dry_beans_data.drop_duplicates()
# Number of duplicate elements of each type
findDuplicates('SEKER');
findDuplicates('BARBUNYA');
findDuplicates('BOMBAY');
findDuplicates('CALI');
findDuplicates('HOROZ');
findDuplicates('SIRA');
findDuplicates('DERMASON');
Reapeated SEKER : 0 Reapeated BARBUNYA : 0 Reapeated BOMBAY : 0 Reapeated CALI : 0 Reapeated HOROZ : 0 Reapeated SIRA : 0 Reapeated DERMASON : 0
All duplicated values were removed.
# Number of elements of each class
print("SEKER: %d" % len(dry_beans_data.loc[dry_beans_data['Class'] == 'SEKER']))
print("BARBUNYA: %d" % len(dry_beans_data.loc[dry_beans_data['Class'] == 'BARBUNYA']))
print("BOMBAY: %d" % len(dry_beans_data.loc[dry_beans_data['Class'] == 'BOMBAY']))
print("CALI: %d" % len(dry_beans_data.loc[dry_beans_data['Class'] == 'CALI']))
print("HOROZ: %d" % len(dry_beans_data.loc[dry_beans_data['Class'] == 'HOROZ']))
print("SIRA: %d" % len(dry_beans_data.loc[dry_beans_data['Class'] == 'SIRA']))
print("DERMASON: %d" % len(dry_beans_data.loc[dry_beans_data['Class'] == 'DERMASON']))
SEKER: 1926 BARBUNYA: 1284 BOMBAY: 500 CALI: 1499 HOROZ: 1603 SIRA: 2402 DERMASON: 3347
As we can see the samples of each bean type are not balanced. As the smaller sample size is 450 (Bombay), we decided to make samples of size 500.
# Select 450 elements of each type
sample_size = 450
seker_sample = dry_beans_data.loc[dry_beans_data['Class'] == 'SEKER'].sample(n = sample_size)
barbunya_sample = dry_beans_data.loc[dry_beans_data['Class'] == 'BARBUNYA'].sample(n = sample_size)
bombay_sample = dry_beans_data.loc[dry_beans_data['Class'] == 'BOMBAY'].sample(n = sample_size)
cali_sample = dry_beans_data.loc[dry_beans_data['Class'] == 'CALI'].sample(n = sample_size)
horoz_sample = dry_beans_data.loc[dry_beans_data['Class'] == 'HOROZ'].sample(n = sample_size)
sira_sample = dry_beans_data.loc[dry_beans_data['Class'] == 'SIRA'].sample(n = sample_size)
dermason_sample = dry_beans_data.loc[dry_beans_data['Class'] == 'DERMASON'].sample(n = sample_size)
sample = pd.concat([seker_sample, barbunya_sample, bombay_sample, cali_sample, horoz_sample, sira_sample, dermason_sample])
#sb.pairplot(sample.dropna(), hue='Class')
plt.figure(figsize=(20, 50))
for column_index, column in enumerate(sample.columns):
if column == 'Class':
continue
plt.subplot(8, 2, column_index + 1)
sb.violinplot(x='Class', y=column, data=sample)
dry_beans_data1 = sample[['Area', 'Perimeter', 'MajorAxisLength', 'MinorAxisLength',
'AspectRation', 'Eccentricity', 'ConvexArea', 'EquivDiameter',
'ShapeFactor4', 'Solidity', 'Extent', 'roundness', 'Compactness',
'ShapeFactor1', 'ShapeFactor2', 'ShapeFactor3', 'Class']]
inputs = dry_beans_data1[['Area', 'Perimeter', 'MajorAxisLength', 'MinorAxisLength',
'AspectRation', 'Eccentricity', 'ConvexArea', 'EquivDiameter',
'ShapeFactor4', 'Solidity', 'Extent', 'roundness', 'Compactness',
'ShapeFactor1', 'ShapeFactor2', 'ShapeFactor3']].values
labels = dry_beans_data1['Class'].values
(training_inputs, testing_inputs, training_classes, testing_classes) = train_test_split(inputs, labels, test_size = 0.25, random_state = 1)
decision_tree_classifier = DecisionTreeClassifier() #classificador
decision_tree_classifier.fit(training_inputs, training_classes) #modelo
decision_tree_classifier.score(testing_inputs, testing_classes)
0.9213197969543148
model_accuracies = []
for repetition in range(100):
(training_inputs,
testing_inputs,
training_classes,
testing_classes) = train_test_split(inputs, labels, test_size=0.25)
decision_tree_classifier = DecisionTreeClassifier()
decision_tree_classifier.fit(training_inputs, training_classes)
classifier_accuracy = decision_tree_classifier.score(testing_inputs, testing_classes)
model_accuracies.append(classifier_accuracy)
plt.hist(model_accuracies)
(array([ 3., 2., 11., 7., 13., 29., 14., 7., 11., 3.]),
array([0.89593909, 0.90025381, 0.90456853, 0.90888325, 0.91319797,
0.91751269, 0.92182741, 0.92614213, 0.93045685, 0.93477157,
0.93908629]),
<BarContainer object of 10 artists>)
#Using cross validation
decision_tree = DecisionTreeClassifier()
cv_scores = cross_val_score(decision_tree, inputs, labels, cv=10)
plt.hist(cv_scores)
plt.title('Average score: {}'.format(np.mean(cv_scores)))
Text(0.5, 1.0, 'Average score: 0.915873015873016')
decision_tree_classifier = DecisionTreeClassifier(max_depth=6)
cv_scores = cross_val_score(decision_tree_classifier, inputs, labels, cv=10)
plt.hist(cv_scores)
plt.title('Average score: {}'.format(np.mean(cv_scores)))
Text(0.5, 1.0, 'Average score: 0.9244444444444445')
decision_tree_classifier = DecisionTreeClassifier()
parameter_grid = {'max_depth': [1, 2, 3, 4, 5, 6, 7, 8],
'max_features': [1, 2, 3, 4, 5, 6, 7, 8]}
cross_validation = StratifiedKFold(n_splits=10)
grid_search=gridSearchScore(decision_tree_classifier, parameter_grid,cross_validation, labels, inputs)
Best score: 0.9244444444444445
Best parameters: {'max_depth': 8, 'max_features': 5}
grid_visualization = grid_search.cv_results_['mean_test_score']
grid_visualization.shape = (8, 8)
sb.heatmap(grid_visualization, cmap='Blues', annot=True)
plt.xticks(np.arange(8) + 0.5, grid_search.param_grid['max_features'])
plt.yticks(np.arange(8) + 0.5, grid_search.param_grid['max_depth'])
plt.xlabel('max_features')
plt.ylabel('max_depth')
Text(33.0, 0.5, 'max_depth')
decision_tree_classifier = DecisionTreeClassifier()
parameter_grid = {'criterion': ['gini', 'entropy'],
'splitter': ['best', 'random'],
'max_depth': [1, 2, 3, 4, 5, 6, 7, 8],
'max_features': [1, 2, 3, 4, 5, 6, 7, 8]}
cross_validation = StratifiedKFold(n_splits=10)
grid_search=gridSearchScore(decision_tree_classifier, parameter_grid,cross_validation, labels, inputs)
Best score: 0.9279365079365081
Best parameters: {'criterion': 'gini', 'max_depth': 8, 'max_features': 6, 'splitter': 'best'}
pred=confusionMatrix(grid_search, inputs,labels)
# Generate metrics
generateMetrics(labels, pred)
Accuracy: 0.9619047619047619 Precision: 0.9627183059917641 Recall: 0.9619047619047619 F1: 0.9620153608466987
# Using default values of this classifier
knn = neighbors.KNeighborsClassifier()
scores = cross_val_score(knn, inputs, labels, cv=10)
print("kNN: ",scores.mean())
kNN: 0.6622222222222223
# Using grid search to test which are the best parameters
parameter_grid = {"n_neighbors": [2,3,4,5,6],
"weights": ['uniform'],
"algorithm": ['auto', 'ball_tree', 'kd_tree', 'brute'],
"p": [1, 2, 3]}
cross_validation = StratifiedKFold(n_splits=10)
knn = neighbors.KNeighborsClassifier()
grid_search=gridSearchScore(knn,parameter_grid,cross_validation,labels,inputs)
Best score: 0.7434920634920635
Best parameters: {'algorithm': 'auto', 'n_neighbors': 3, 'p': 1, 'weights': 'uniform'}
pred=confusionMatrix(grid_search, inputs, labels)
# Generate metrics
generateMetrics(labels,pred)
Accuracy: 0.8857142857142857 Precision: 0.8859954230897298 Recall: 0.8857142857142857 F1: 0.885531123953679
# Using default values of this classifier
gnb = GaussianNB()
scores = cross_val_score(gnb, inputs, labels, cv=10)
print("NB: ",scores.mean())
NB: 0.7863492063492064
# Using grid search to test which are the best parameters
parameter_grid = {"var_smoothing": [1e-9,1e-10,1e-11,1e-12,1e-13,1e-14,1e-15,1e-16,1e-17,1e-18]}
cross_validation = StratifiedKFold(n_splits=10)
gnb = GaussianNB()
grid_search=gridSearchScore(gnb,parameter_grid,cross_validation,labels,inputs)
Best score: 0.9292063492063493
Best parameters: {'var_smoothing': 1e-16}
pred=confusionMatrix(grid_search, inputs, labels)
generateMetrics(labels,pred)
Accuracy: 0.9298412698412698 Precision: 0.9317752107372169 Recall: 0.9298412698412698 F1: 0.9299358390690126
# Using default values of this classifier
svm = SVC()
scores = cross_val_score(svm, inputs, labels, cv=10)
print("SVC:",scores.mean())
SVC: 0.6399999999999999
# Using grid search to test which are the best parameters
parameter_grid = {"kernel": ['rbf'],#poly
"degree": [1, 2, 3, 4, 5],
"gamma": ['auto', 'scale']}
cross_validation = StratifiedKFold(n_splits=10)
svm = SVC()
grid_search=gridSearchScore(svm,parameter_grid,cross_validation,labels,inputs)
Best score: 0.6399999999999999
Best parameters: {'degree': 1, 'gamma': 'scale', 'kernel': 'rbf'}
pred=confusionMatrix(grid_search,inputs,labels)
# Generate metrics
generateMetrics(labels,pred)
Accuracy: 0.6412698412698413 Precision: 0.6387116128124439 Recall: 0.6412698412698413 F1: 0.6363859655804733
mlp = MLPClassifier()
scores = cross_val_score(mlp, inputs, labels, cv=10)
print("MLP: ",scores.mean())
MLP: 0.306031746031746
# Using grid search to test which are the best parameters
parameter_grid = {"solver": ['adam'],
"hidden_layer_sizes": [8, 9, 10, 11],
"alpha": [1e-8, 1e-9, 1e-10, 1e-11],
"max_iter": [10000],
"random_state": [1]}
cross_validation = StratifiedKFold(n_splits=10)
mlp = MLPClassifier()
grid_search=gridSearchScore(mlp,parameter_grid,cross_validation,labels,inputs)
Best score: 0.41492063492063486
Best parameters: {'alpha': 1e-11, 'hidden_layer_sizes': 11, 'max_iter': 10000, 'random_state': 1, 'solver': 'adam'}
pred=confusionMatrix(grid_search,inputs,labels)
# Generate metrics
generateMetrics(labels,pred)
Accuracy: 0.4247619047619048 Precision: 0.475922773945182 Recall: 0.4247619047619048 F1: 0.3953586033645202
With this project we have concluded that in order to implement a Machine Learning technique, in our case Supervised Learning, it is necessary to go through many steps in order to obtain an acceptable model.
Firstly we went through an analysis phase, which is always necessary to make a good model, because as it is known "a bad dataset leads to bad models". In order to avoid that we removed outliers, duplicate and null values, this gave a more reliable and accurate dataset. In order for the tests to be uniform we sampled our data. Since the size of the data we had varied widely for each type of dry bean (not balanced) this could lead to a known problem where the precision wouldn't be realistic. For instance we could've had a model that had a precision of almost 100% but, in reality, what that model was doing is predict the same most common class for all the types of samples. To avoid the overfitting fenomena we used k-folding which splits the dataset in k different sets, this enables the usage of all the data for training and testing and, with that, the models behaves well for both.
In relation to the models to obtain the best classificators, it was necessary to do some research about their corresponding parameters. After that we used grid search, which would show the best combination of those parameters, returning the best model possible.
With the intent to evaluate the performance in each model we used many techniques: accuracy, precision, recall and f-measure. The accuracy allowed us to understand the percentage of predictions that model was correct, precision gave us the percentage of the values that the model correctly predicted as positive to the total predicted positive observations. Recall is the ratio of correctly predicted positive values out of all the positive examples in the dataset. F-measure uses an average of precision and recall, taking into account false negatives and false positives. After we looked at the results using all the classification techniques mentioned we verified that the "Decision Tree" and "Naive Bayes" were the classifier that obtained the best results with scores above 90%.
Group 36: